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Abstract 



We introduce generalized notions of a divergence function and a Fisher 
information matrix. We propose to generalize the notion of an exponential 
family of models by reformulating it in terms of the Fisher information 
matrix. Our methods are those of information geometry. The context is 
CZ5 ' general enough to include applications from outside statistics. 

o ; 

1 Introduction 

• The literature contains several generalizations of the concept of models belong- 

ing to the exponential family [T] . See for instance El IH El El [7] . The present 
work gives such a definition in a context of an abstract information theory, which 
is not necessarily based on probability. The main tools are those of information 
CN . geometry [5] , in particular generalized divergence functions [3J SI ESI HOI 111) . 

They can be used to define a generalized Fisher information matrix and gener- 
alized exponential families (Definitions 1 and 2 in Section [5]). 

The motivation for the present work comes from physics. Applications of 
the new definitions in the context of classical and of quantum mechanics will 
\ follow in a separate publication |12) . A preliminary write-up of the present 

work, including one non-statistical example, is found in j!3j . 

The next section introduces a generalized divergence in an abstract set- 
ting. The Bregman divergence, discussed in Section [3j is an important subcase. 
Section [4] introduces our definitions of generalized Fisher information and of 
generalized exponential families. Sufficient conditions for a family to belong to 
a generalized family follow in Section El The final two sections show how our 
definitions relate to other generalizations found in the literature. 

2 Definitions 

The abstract information framework X, M, Q, // consists of a topological space 
X, a differentiable manifold M, and a linear space Q of real functions of X. In 
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addition there is given a continuous map /i : X — > M. The space X contains 
data sets. The map /i associates a model point with each data set. The space 
Q contains questions about the data sets. To stress that Q is not necessarily 
an algebra the notation (x\q) is used rather than q(x) to evaluate q € Q in the 
point x £ X. The constant function 1 belongs to Q and satisfies (x\l) = 1 for 
all x in X. 

A generalized divergence is a map D : X x M — >• [0, oo] satisfying the condi- 
tions 

• (compatibility) for each a; in X is fi{x) the unique element of M minimizing 
the divergence m — > D(x\\m); 

• (consistency) for each m in M is = mi x {D(x\\m) : (x{x) — m). 

The divergence is interpreted as the amount of information which is lost when 
the data set x is replaced by the model point m. 

Throughout the paper we assume that there exist functions £ : M — >• M, 
C : X -4 R and a diffeomorphism L : M -> Q such that for all x G X and m e M 
one has 

Z?(x||m) = £(m) — C(a;) — (x\Lm). (1) 

From the compatibility condition follows the requirement that the map m — ► 
£(m) — (x|Xm) is minimal when m = /x(x). From the the positivity D(x\\m) > 
and the consistency condition follows 

£(m) = sup{C(a;) + (x\Lm)} 

X 

— sup{C(a;) + (x\Lm) : n(x) — m}. (2) 

X 

The function C, has the meaning of an entropy function. The map L is called 
the logarithmic map because in the standard case (see below) it is essentially 
the natural logarithm. The function £ is called the corrector [14]. We assume 
in what follows that it is a differentiable function. 

3 Bregman divergence 

The obvious example of our framework is that of a statistical model. Let X be 
the affine space of probability distributions over a finite alphabet A. A question 
q G Q is a real function of A. The evaluation of q in the point x is given by 

(x\q) = E x q = ^2 x(a)q(a). (3) 

Let 9 £ ® C 1" — > mg £ X be a statistical model with sufficiently nice properties 
so that the set 

M = {m g : 6 6 9} C X (4) 

is a differentiable manifold. 

A divergence of the Bregman type [HI E] is defined by 

D(x\ |m) = i F ( x ( a )) ~ F{m(a)) ~ {x(a) - m(a)) f (m(a))} 
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£ du[f(u)-f(m(a))], (5) 

„ J m(a) 



where -F is any strictly convex function denned on the interval (0, 1] and f = F' 
is its derivative. The standard case, involving the Boltzmann-Gibbs-Shannon 
entropy, is recovered when F(u) = it In it. 

Assume that the function F is twice difFerentiable. The logarithmic map 
is given by Lm(a) = f(m(a)). The entropy function is C( x ) = — Sa^ ? ( a; ( a ))- 
The consistency condition ^ follows from the convexity of the function F(u). 
Indeed, it implies that 

- F(x(a)) < -F(m(a)) - (x(a) - m(a))f(m(a)) (6) 

so that 

C(x) + (x\Lm) = J2[-F{x(a))+x(a)f(m(a))] 

a 

< J2[-F(m(a))+m(a)f(m{a))} 

a 

= CM + (m\Lm). (7) 
This implies @ . The model map /x is given by 

(i(x) = argmin m {CM - (x\Lm)}, (8) 
assuming existence and uniqueness of the minimum. 

4 Generalized exponential families 

Introduce now coordinates 9 — > mg for the model manifold M. Use the notations 
£(0) = C( m s) an d D (x\\6) = D(x\\mg). By assumption the functions and 
9 — > (x\Lmg) are difFerentiable. Therefore the first derivatives 

vanish when mg = )i(x). 

Definition The matrix of second derivatives 
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is the generalized Fisher information matrix. 



(10) 

mi=(i(l) 



Proposition 4.1 The matrix Ik j(x) is covariant under coordinate transforma- 
tions. 

Proof 

Let rj be a function of 9. One calculates 

rD(x\\9) = -^—D(x\\9) 0r] ' 



de k de i vii/ drj m drj n v " ' d9 k d9 l 
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( d \ d 2 n m 



\dn m v 1 7 9^ fe S6> 

The latter term vanishes when m$ = fJ.(x). What remains is covariant under 
coordinate transformations. 

□ 

Definition The model X, M, Q, D belongs to a generalized exponential family 
if the Fisher information matrix Jj. j (x), defined by ([10[) . is constant on the fibers 
J" m = {a; : fi(x) = m}. 

The constant value is then denoted Ik,i(9). 

A justification of this definition follows later on from the study of the def- 
inition in the familiar context of divergencies of the Bregman type. The main 
advantage of the above definition is that it does not specify a particular choice 
of coordinates. That the Fisher information matrix is constant on the fiber T m 
is a scaling property. It means that locally the manifold looks always the same, 
independent of the point of view x £ T m . 

5 Sufficient conditions 

It is obvious to define a divergence between model points by 

D{m\\n) = inf{D(x||n) : fJ-(x) — m}. (12) 

X 

It satisfies D(m\\n) > 0. Because of the special form ([!} of the divergence there 
follows 

D(m\\n) — — sup{C(x) + (x\Ln) : fi(x) — m}. (13) 

X 

Using the consistency condition @ one can write 

D(m\\n) = sup{C(x) + (x\Ln) : /i(x) = n) 

X 

— sup{^(x) + (x\Ln) : fi(x) — m}. (14) 

X 

In particular, D(m\\m) = holds. 

Theorem 5.1 Assume that the following Pythagorean relation^^ holds 

xeT e => D(x\\9) +D(0\\ V ) = D[x\\rj). (15) 
Then the model belongs to the generalized exponential family. 



Proof 

From (fl5|) follows 

h,i(x) 



d 



:2 



D{x\\n) - , 



This shows that Ik,i{x) is constant along the fiber J- ( 



D(e\\rj). (16) 

,,=e 



□ 



The Pythagorean equality (fT5")) expresses the intuition that the projection \i 
on the model manifold M is orthogonal. 
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Theorem 5.2 If the logarithmic map is of the form 
Lm e = -a{6) - q - r] k (9)q k 



(17) 



with functions a and rj k , and questions qo, q k in Q, then the Pythagorean relation 
\15\) is satisfied. In particular, the model belongs to a generalized exponential 
family. 

Proof 

Introduce the abbreviation $ = £ + a. From the definition of £ follows that 



= sup{C(z) - (x\q ) - n k (9)(x\q k )}. 



(18) 



Hence $ depends on 9 only via the functions r\ k . In combination with 

D(x\\e) = $(6)-{(x) + (x\q )+ri k (6)(x\q k ) (19) 

and the assumption that for each x there is a unique 9 minimizing _D(:r||#) one 
concludes that the map 9 — ¥ r\ is invertible. This observation is essential to 
conclude that {x\q k ) is constant along the fibers Tg. One has indeed for x G J-g 







_0 

09 
dr) 



89 k 



so that 



(x\q m ) 



<9$ 
di] m 



dr] m 



(x\q m ) 



(20) 



(21) 



Now calculate, still assuming that i€ Jj, and using that (x\q k ) is constant 
along F e , 

D{9\\9') = M{D(x\\e') : xeT e } 

= *(0') - sup{C(x) - (x\q ) - v k (0){x\q k ) : x 6 Tg} 

X 

= *(0') - su P {C(.t) - (x\q ) : x e ?„} - r 1 k {9){x\q k ) 
^(9') - m + ( V k (9 f ) - r 1 k (9))(x\q k ) 



= D(x\\9') -D(x\\ 
This shows the Pythagorean relation. 



(22) 



□ 



6 Justification 

We now return to Section [3] which deals with the Bregman divergence. In this 
context we give an explicit characterisation of the generalized exponential family 
and show that it is satisfied by the more common definition. 
Taking derivatives of ([5]) yields 

^D{x\\9) = -Y,Ho)-m e {a)]-^f{mg{a)). (23) 

a 
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and, assuming jj,(x) = me, 



h,i{x) = H/'( m e( a ))S-( a )S( a ) 



d9 ky ' d8 l 
d 2 
d8 k dO l 

a 

Independence of x along Fq implies 



]T [x(a) - m e (a)} -^—f(m e (a)). (24) 



hM = Y,f'( m ^w* {a) w ia) (25) 

a 

and 

d 2 

22[x(a)-m e (a)]-^ j — [ f(me(a)) = for all x e Tq. (26) 

a 

One concludes that the model belongs to the generalized exponential family if 
the set of equations (121)1) holds for all x satisfying x G J-g and the normaliza- 
tion condition J2 a x ( a ) = With f'{u) = 1/u expression (|23|) reduces to the 
standard expression for the Fisher information matrix. 

The obvious solution of (121)1) is that there exist coordinates r/(0) such that 

d 2 

o i o i f( m e( a )) does not depend on a. (27) 
Indeed, (|2"6")l can be written as 



= ^[a(a)-rrao(a)] ( ^f ^ /(mo (a))) 



00 & <96> z 

+ 5>(a)-ro,(a)] ^—f( me (a))j (28) 

Because of the ansatz l|27p the former term vanishes. The latter vanishes because 
X £ J-$. 

The requirement (|27|) is equivalent with the existence of functions qo and 
such that for all a and one fixed b 

f{me(a))-f(m 8 (b)) = q (a) + V k : q k (a) . (29) 

One obtains 

(x\f(mg)) = f(m e (b)) + (x\q )+v k (x\q k ). (30) 
This expression is of the form (fTTj) (note that f(mg(a)) = Lm$(a)). 

7 Discussion 

We propose to replace current definitions of generalized exponential families by 
one formulated in terms of a generalized Fisher information — see Section 2) 
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The new definition can be used in a more abstract setting of information theory, 
one which does not necessarily rely on probability theory. 

The central tool of the present paper is an asymmetric divergence D(x\\m) 
between data sets x and model points m. Divergences of this kind occur in 
game theory — see for instance Section 8 of [3]. They generalize the notion of 
a Bregman divergence [5]. 

The notion of a generalized exponential family is usually formulated directly 
in terms of the function / appearing in the generalized divergence by an expres- 
sion similar to (|30[) . We propose here to use the divergence in the first place to 
define a generalized Fisher information matrix. The latter is then used to define 
the generalized exponential families. 

In [2] the function /, occurring in pO[) and defining the logarithmic map L 
of (j3"0)l . is assumed to be of the form 

/(tt ) = r d „ * (si) 

Ji n v ) 

with <f> positive and increasing, and is called a deformed logarithm. The (/>- 
deformed exponential family is then defined by an expression of the form (|17p . 
See also [TJ. The special case with <j>{v) = v q is the g-deformed logarithm 
considered in non-extensive statistical physics [5J II 51 116) . The corresponding 
exponential families coincide with Amari's a-families [HI IB]- 

An alternative for the Bregman divergence is the U-divergence 0]. In our 
notations it reads 

pf(m(a)) 

D u (x\\m) = J2 du[g(u)-x(a)}, (32) 

where U is a convex increasing function, g — U' and / is the inverse function 
of g (note that / is the deformed logarithm, g the deformed exponential func- 
tion in the language of non-extensive statistical physics). The [/-model is then 
introduced in [3] as a generalization of the exponential model and is defined by 
a relation of the form (|17p . 
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